Tolerating Correlated Failures in Wide-Area Monitoring Services

نویسندگان

  • Suman Nath
  • Haifeng Yu
  • Phillip B. Gibbons
  • Srinivasan Seshan
چکیده

Recently, there has been increasing interest in systems (e.g., Content Delivery Networks, Distributed Hash Tables, distributed file systems) running in large, wide-area distributed environments (e.g., Akamai, PlanetLab, RON). A key challenge to making these systems robust is the presence of correlated failures. In the context of a customizable wide-area monitoring service we developed, this paper is the first to comprehensively study the negative effects of failure correlation on availability and how to mitigate these effects. To achieve our availability goals, our service incorporates a replication design based on Signed Quorum Systems and a load-balancing design based on a novel database fragmentation algorithm POST. Through extensive live deployment on PlanetLab, trace-driven emulation and simulation, and a model-based sensitivity analysis, we observe that i) failure correlation is significant and has dramatic negative effects on availability; ii) correlation results in significantly diminishing returns for replication; and iii) our solutions are effective in mitigating the negative effects of correlation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Subtleties in Tolerating Correlated Failures

High availability is widely accepted as an explicit requirement for distributed storage systems. Tolerating correlated failures is a key issue in achieving high availability in today’s wide-area environments. This paper systematically revisits previously proposed techniques for addressing correlated failures. Using a combination of experimental and mathematical analysis of several real-world fa...

متن کامل

Subtleties in Tolerating Correlated Failures in Wide-area Storage Systems

High availability is widely accepted as an explicit requirement for distributed storage systems. Tolerating correlated failures is a key issue in achieving high availability in today’s wide-area environments. This paper systematically revisits previously proposed techniques for addressing correlated failures. Using several real-world failure traces, we qualitatively answer four important questi...

متن کامل

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

Detecting network path anomalies generally requires examining large volumes of traffic data to find misbehavior. We observe that wide-area services, such as peerto-peer systems and content distribution networks, exhibit large traffic volumes, spread over large numbers of geographically-dispersed endpoints. This makes them ideal candidates for observing wide-area network behavior. Specifically, ...

متن کامل

Neuron - A Wide-Area Service Discovery Infrastructure

A wide-area service discovery infrastructure provides a repository in which services over a wide area can register themselves and clients everywhere can inquire about them. In this paper, we discuss how to build such an infrastructure based on the peer-to-peer model. The proposed system, called Neuron, can be executed on top of a set of federated nodes across the global network and aggregate th...

متن کامل

Modeling with dependent failures

My broad research interest is in dependable systems, in particular developing fault-tolerant distributed algorithms and applying them to practical problems. Developing dependable systems is an important goal as we increasingly rely upon large-scale wide-area distributed systems to support a wide range of online services. As systems scale in size and extent, efficiently coping with failures is a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004